Table Header Detection and Classification

نویسندگان

  • Jing Fang
  • Prasenjit Mitra
  • Zhi Tang
  • C. Lee Giles
چکیده

In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of .that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scantab: Table Recognition by Reference Tables

The ScanTab system represents a knowledge-based approach to table recognition in scanned documents. In contrast to most systems which recognize tables by grouping layout information, our system uses predefined information about which table types may appear. This enables a very accurate detection able to cope with distorted tables and tables providing little layout information, e.g., no lines, b...

متن کامل

Header Processing Requirements and Implementation Complexity for IPv4 Routers

header processing, IPV4, IPV6, internet, networking, router performance, packet classification The complexity of IP header processing is commonly believed to be the main limiting factor for router performance. The purpose of this report is to explore header processing requirements for IP routers, identify different implementation alternatives and discuss their complexity. The conclusion of the ...

متن کامل

Conflict Detection in Internet Router Tables

Preamble. Packet filters are rules in IP router tables for classifying packets based on the information in their header fields. For forwarding purposes, there has to be a unique best matching filter which applies to an incoming packet p. In order to avoid ambiguities in the classification, the set of filters must be conflict-free under the tie-breaking rule which is applied. In this report we e...

متن کامل

Efficient Header Classification Architecture for Network Intrusion Detection

In this paper, an efficient FPGA-based header classification circuit is proposed for network intrusion detection system (NIDS). The circuit is based on simple shift registers and symbol encoders for the fast packet header classification in hardware. As compared with related work, experimental results show that the proposed work achieves higher throughput and less hardware resource in the FPGA i...

متن کامل

Classifying Malicious Windows Executables Using Anomaly Based Detection

CLASSIFYING MALICIOUS WINDOWS EXECUTABLES USING ANOMALY BASED DETECTION by Ronak Sutaria A malicious executable is broadly defined as any program or piece of code designed to cause damage to a system or the information it contains, or to prevent the system from being used in a normal manner. A generic term used to describe any kind of malicious software is Malware, which includes Viruses, Worms...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012